Project Author: Alexander Lacson
Data Source: Acquired from Codecademy, which says the data is "from the World Health Organization and the World Bank".
This is a project which explores and analyzes a single dataset containing Life Expectancy and GDP from six different countries. This analysis gives us insight into two questions:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.read_csv("all_data.csv")
df.replace('United States of America', 'USA', inplace = True)
import plotly.express as px
locations = ['USA', 'CHL', 'CHN', 'DEU', 'MEX', 'ZWE']
fig = px.choropleth(locations, locations=locations, color=locations)
fig.show()
The six countries in the set are spread out over different continents and regions of the world.
plt.figure(figsize = (8, 5))
sns.lineplot(data = df, x = 'Year', y = 'Life expectancy at birth (years)', hue = 'Country')
plt.title("Life Expectancy from 2000 to 2015 by Country")
plt.legend(df.Country.unique(), bbox_to_anchor=(1,1), loc = 'upper left', frameon = False)
plt.show()
Over time we see a steady increase in life expectancy for all countries. Zimbabwe is separate from the rest of the group but its steep upward slope shows that it is quickly catching up.
sns.boxplot(data = df, x = 'Country', y = 'Life expectancy at birth (years)')
plt.title('Life Expectancy by Country')
plt.show()
Zimbabwe shows lower life expectancy and higher variation (increasing over time as seen in the previous figure) compared to the rest of the countries in the dataset.
plt.figure(figsize = (8, 5))
sns.lineplot(data = df, x = 'Year', y = 'GDP', hue = 'Country')
plt.title("GDP from 2000 to 2015 by Country")
plt.legend(df.Country.unique(), bbox_to_anchor=(1,1), loc = 'upper left', frameon = False)
plt.show()
USA and China have economies that grow much faster than the rest of the world.
sns.boxplot(data = df, x = 'Country', y = 'GDP')
plt.title('GDP by Country')
plt.show()
We can see that the US has a significantly larger economy. We can also see that China's and the US's economy are undergoing relative great change (growth over time, as can be seen in the previous figure) compared to the other countries.
GDP_life_corr = [[each, df[df.Country == each].corr().GDP.iloc[1]] for each in df.Country.unique()]
gdp_life_corr_df = pd.DataFrame(GDP_life_corr, columns = ['Country', 'GDPvsLE correlation'])
gdp_life_corr_df_sorted = gdp_life_corr_df.sort_values(by = 'GDPvsLE correlation', ascending = False).reset_index(drop = True)
###
sns.set_style("white")
grid = sns.FacetGrid(df, col="Country", col_wrap=3, sharex = False, sharey = False, height=3.5, aspect = 1)
grid.map(sns.regplot, "GDP", 'Life expectancy at birth (years)')
for ((_, ax), corr) in zip(grid.axes_dict.items(), gdp_life_corr_df['GDPvsLE correlation'].tolist()):
ax.text(0.05, 0.9, "r-score = " + "{:.2f}".format(corr), horizontalalignment='left',\
verticalalignment='center', transform=ax.transAxes)
plt.show()
The r-scores tell us that there is a very strong positive correlation between life expectancy and GDP for each country with respect to its own population. The Life expectancy of the citizens of a country is a function of that country's own GDP. Keep in mind, the r-score only tells us the strength of the association between GDP and life expectancy. It does not tell us how much of an effect GDP is having on life expectancy.
from sklearn.linear_model import LinearRegression
import plotly.graph_objects as go
import numpy as np
fig = px.scatter(df[df['Country'] == 'China'], x='GDP', y='Life expectancy at birth (years)', color = 'Year',
template='plotly_white',
title = 'GDP vs Life Expectancy of China')
X = df[df['Country'] == 'China']['GDP'].iloc[0:5].values.reshape(-1, 1)
y = df[df['Country'] == 'China']['Life expectancy at birth (years)'].iloc[0:5].values.reshape(-1, 1)
reg = LinearRegression().fit(X, y)
fig.add_trace(go.Scatter(x = X.flatten().tolist(), y = reg.predict(X).flatten().tolist(), mode = 'lines', name = 'China',
showlegend =False, hoverinfo = 'none'))
X = df[df['Country'] == 'China']['GDP'].iloc[5:].values.reshape(-1, 1)
y = df[df['Country'] == 'China']['Life expectancy at birth (years)'].iloc[5:].values.reshape(-1, 1)
reg = LinearRegression().fit(X, y)
fig.add_trace(go.Scatter(x = X.flatten().tolist(), y = reg.predict(X).flatten().tolist(), mode = 'lines', name = 'China',
showlegend =False, hoverinfo = 'none'))
fig.show()
Zooming in on China, we can see that it actually consists of two GDP vs Life Expectancy Regression lines. It has a steeper slope in the years 2000-2006, and then suddenly transitions to a more horizontal slope in the years 2006-2015. This seems to act like a 'missing link' in our data, showing that at some point a country's GDP vs Life Expectancy relationship transitions from a steep to flat line.
Unlike the previously calculated r-score of 0.91, the calculated r-score of the regression line for 2006 onwards is 0.98.
gdp_life_corr_df_sorted.at[5,'GDPvsLE correlation']=df[df.Country == 'China'].iloc[5:].corr().GDP.iloc[1]
gdp_life_corr_df_sorted = gdp_life_corr_df_sorted.sort_values(by = 'GDPvsLE correlation', ascending = False).reset_index(drop = True)
sns.set_style("whitegrid")
sns.set_context("notebook")
figure = sns.barplot(data = gdp_life_corr_df_sorted, x = 'Country', y = 'GDPvsLE correlation')
figure.axis(ymin = 0.89, ymax = 1.0)
plt.title("Association between GDP and Life Expectancy by Country")
figure.text(0.3, 0.95, '*China using r-score for data 2006 onwards', horizontalalignment='left',\
verticalalignment='center', transform=figure.transAxes)
plt.show()
Not all countries have the same magnitude of correlation between life expectancy and GDP. It could mean that for some countries, the health of its citizens is not as dependent on how well its economy is doing compared to other countries. Perhaps the climate, food, environment, political, and geographical factors are having larger influences.
fig = px.scatter(df, x='GDP', y='Life expectancy at birth (years)', color ='Country', template='plotly_white',
title = 'GDP vs Life Expectancy of Countries Together on a Single Plot')
slopes = []
for country in df.Country.unique():
X = df[df['Country'] == country]['GDP'].values.reshape(-1, 1)
y = df[df['Country'] == country]['Life expectancy at birth (years)'].values.reshape(-1, 1)
reg = LinearRegression().fit(X, y)
fig.add_trace(go.Scatter(x = X.flatten().tolist(), y = reg.predict(X).flatten().tolist(), mode = 'lines', name = country,
showlegend =False, hoverinfo = 'none'))
fig.update_traces(legendgroup = country, selector = dict(name=country))
slopes.append((reg.coef_[0][0], country))
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1
))
fig.show()
The above plot is interactive.
Let us focus our attention on the slopes of the regression lines of each country. We can see that it differs by country. The slope tells us how much of an effect GDP is having on life expectancy. As we move from left to right on the plot, we can clearly see that the slopes shift from being vertical towards being horizontal. This can be understood as 'Countries with low GDP see great gains in life expectancy for every increase in GDP, but as countries start to have larger GDPs, they have to acquire more GDP to get the same effect on life expectancy'. It seems to bear some resemblance to the law of diminishing returns. This is a reason to give support to developing nations. Small economic improvements they make goes a long way towards improving lives. This would justify institutions like the World Bank. The World Bank is an international organization that offers developmental assistance to middle-income and low-income countries.5
For Predictive Statistical Modelling, considering the diminishing effect when viewed across different countries, it could be worth trying to use a higher-order curve rather than a linear curve to fit the data. It could also be worth investigating if there is a "phase-transition" point, where the behavior/slope suddenly changes in a piecewise fashion. This is most evident in the plot of China, where it looks like there is a transition from a steep slope towards a flat slope.
DON'T PANIC — Hans Rosling showing the facts about population. A highlight of the talk, which ties in to the points made here, is at the 36:00 mark, Hans Rosling begins explaining how a relatively small amount of money makes a significantly large amount of difference for people in the developing world.